Model Selection

Image-Text Generation

# Image-Text Generation

Qwen Qwen2.5 VL 72B Instruct GGUF

A quantized version of the Qwen2.5-VL-72B-Instruct multimodal large language model, supporting image-text-to-text tasks, suitable for various quantization levels from high precision to low memory requirements.

Text-to-Image English

Qwen2.5-VL-7B-Instruct is a multimodal model based on the Qwen2.5 architecture, supporting joint processing of images and text, suitable for vision-language tasks.

Safetensors English

UI TARS 1.5 7B 4bit

UI-TARS-1.5-7B-4bit is a multimodal model focused on image-text-to-text conversion tasks, supporting the English language.

Transformers Supports Multiple Languages

Internvl3 1B Hf

InternVL3 is an advanced series of multimodal large language models, demonstrating exceptional multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

A multimodal model trained based on google/gemma-3-4b-it, specializing in high-quality data processing for mathematics, programming, science, and puzzle-solving domains.

Transformers English

Gemma 3 4b It GPTQ 4b 128g

INT4 quantized version based on the gemma-3-4b-it model, significantly reducing storage and computational resource requirements

Qwen2.5 VL 7B Instruct Gptqmodel Int8

A vision-language model based on the Qwen2.5-VL-7B-Instruct model with GPTQ-INT8 quantization

Transformers Supports Multiple Languages

Gemma 3 12b It Qat Q4 0 Unquantized

Gemma 3 is Google's lightweight open-source multimodal model series based on Gemini technology, supporting text and image inputs with text outputs. The 12B version undergoes instruction tuning and quantization-aware training (QAT), making it suitable for deployment in resource-limited environments.

Vora 7B Instruct

VoRA is a vision-language model based on 7B parameters, focusing on image-text-to-text conversion tasks.

VoRA is a vision-language model based on 7B parameters, capable of processing image and text inputs to generate text outputs.

Qwen2.5 VL 7B Instruct Q4 K M GGUF

This is the GGUF quantized version of the Qwen2.5-VL-7B-Instruct model, suitable for multimodal tasks and supports both image and text inputs.

Image-to-Text English

Qwen2.5 VL 7B Instruct GGUF

Qwen2.5-VL-7B-Instruct is a multimodal vision-language model that supports image understanding and text generation tasks.

Image-to-Text English

Heron NVILA Lite 1B

A Japanese visual language model trained based on the NVILA-Lite architecture, supporting image-text interaction in both Japanese and English

Safetensors Supports Multiple Languages

Qwen.qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a large-scale vision-language model developed by the Tongyi Qianwen team, supporting multimodal understanding and generation of images and text.

Gemma 3 4b Pt Qat Q4 0 Gguf

Gemma 3 is a lightweight open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.

Meta's Chameleon series 7B-parameter multimodal model supporting image-text-to-text tasks

Large Language Model

Toriigate V0.4 7B GGUF

The static quantized version of ToriiGate-v0.4-7B, suitable for multimodal, vision, and image-text-to-text tasks

Transformers English

Internvl2 5 4B AWQ

InternVL2_5-4B-AWQ is the AWQ quantized version of InternVL2_5-4B using autoawq, supporting multilingual and multimodal tasks.

Transformers Other

Gemma is a lightweight, advanced open model series launched by Google, built on the same research and technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.

Qwen2 VL 2B Instruct GGUF

Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports interaction between images and text, suitable for image understanding and generation tasks.

Image-to-Text English

Qwen2 VL 7B Instruct GGUF

Qwen2-VL-7B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for images and text.

Image-to-Text English

Minivla Libero90 Prismatic

MiniVLA is a 1-billion-parameter vision-language model compatible with the Prismatic Vision-Language Model codebase, suitable for robotics and multimodal tasks.

Transformers English

P MoD LLaVA NeXT 7B

p-MoD is a hybrid-depth multimodal large language model built using the progressive ratio decay method, supporting image-to-text generation tasks.

Paligemma2 10b Pt 224

PaliGemma 2 is a vision-language model (VLM) that combines the capabilities of the Gemma 2 model. It can process both image and text inputs simultaneously and generate text outputs, supporting multiple languages. It is suitable for various vision-language tasks such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.

Paligemma2 10b Mix 224

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.

Xgen Mm Phi3 Mini Instruct Interleave R V1.5

xGen-MM is a series of the latest foundational large multimodal models (LMMs) developed by Salesforce AI Research, building upon the successful design of the BLIP series with foundational enhancements to ensure a more robust and superior model foundation.

Safetensors English

Llava MORE Llama 3 1 8B Finetuning

LLaVA-MORE is an enhanced version based on the LLaVA architecture, integrating LLaMA 3.1 as the language model, focusing on image-to-text tasks.

Llama 3.1 8B Vision 378

This project trained a projection module to add visual capabilities to Llama 3 using SigLIP technology, applied to the Llama-3.1-8B-Instruct model.

Florence 2 Large Ft

Florence-2 is an advanced visual foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.

Florence 2 Large

Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.

Paligemma 3b Mix 224

PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.

Llava Llama 3 8b V1 1 Q3 K S GGUF

This model is a GGUF format conversion based on xtuner/llava-llama-3-8b-v1_1, supporting multimodal processing of images and text.

Llava Llama 3 8b V1 1 Q5 K M GGUF

This model is a GGUF format version converted from xtuner/llava-llama-3-8b-v1_1, suitable for the llama.cpp framework, supporting image-text-to-text conversion tasks.

Llava Llama 3 8b V1 1 Q4 K M GGUF

This model is a GGUF format conversion based on xtuner/llava-llama-3-8b-v1_1, supporting multimodal interaction between images and text.

MoAI is a large-scale language and vision hybrid model capable of processing both image and text inputs to generate text outputs.

Llava V1.6 Vicuna 7b Gguf

LLaVA is an open-source multimodal chatbot trained by fine-tuning LLM on multimodal instruction-following data. This version is the GGUF quantized version, offering multiple quantization options.

Llava V1.5 7b Gguf

LLaVA is an open-source multimodal chatbot, fine-tuned on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.

Image Captioning With Blip

BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation

IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase